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METHOD AND SYSTEM OF DATA ANALYSIS 
USING NEURAL NETWORKS 



FIELD OF THE INVENTION 
5 The present invention relates generally to computer systems, and more 

speclflcally, to using neural network applications to perform data mining and 
data analysis. 

BACKGROUND OF THE INVENTION 
10 Neural networks and neural network applications are known in the art. 

Experiments In biological neural network have determined that the strength of 
synaptic connections between neurons in the brain is a function of the 
frequency of excitation. Neurons are presented with numerous stimuli (input 
signals, produced by some extemal action, such as the eye viewing an object, 
15 or the skin sensing temperature). After suffteient exposure to sensorial stimuli 
from an environment, a collection of neurons will start to react differently, 
depending on the strength of the individual stimuli. One effect of this process is 
that certain neurons, or collections of neurons, are more likely to fire when 
presented with certain patterns rather than others. The same collection of 
20 neurons Is also sensitive to patterns that are fairly similar. This sensitivity can 
over time be construed as 'learning' a certain part of an input space. 

T. Kohonen has created one mathematical abstraction of the above- 
described neural network process, known as the Kohonen algorithm, which is 
discussed in detail in various writings. The Kohonen algorithm has been used 
25 to model simple models of the cortex and has also been used in other 
applications. However, present applications have not addressed all of the 
needs related to computer implemented data analysis using neural networtc 
models. 



30 SUMMARY OF THE INVENTION 

According to one embodiment of the invention, a method of computer 
data analysis using neural networics is disclosed. The method includes 
generating a data representation using a data set, the data set including a 
plurality of attributes, wherein generating the data representation includes: 
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modifying the data set using a training algorithm, wherein the training algorithm 
includes growing the data set; and performing convergence testing, wherein 
convergence testing checks for convergence of the training algorithm, and 
wherein the modifying of the data set Is repeated until convergence of the 

5 training algorithm occurs; and displaying one or more subsets of the data set 
using the data representation. The data representation may Include a latent 
model. A latent model may include a simplified model of the original data or 
data set, representing trends and other information which may not have been 
present or accessible in the original data. This may be done by constructing a 

10 new set of data vectors, initialized through a principle plane initialization, that 
are adapted to become more similar to the original data. The original data may 
not be changed. 

According to another embodiment, a system for performing data analysis 
using neural networics is disclosed. The system Includes one or more 

1 5 processors; one or more memories coupled to the one or more processors; and 
program instructions stored In the one or more memories, the one or more 
processors being operable to execute the program instructions, the program 
instructions including: generating a data representation using a data set, the 
data set Including a plurality of attributes, wherein generating the data 

20 representation includes: modifying the data set using a training algorithm, 
wherein the training algorithm includes growing the data set; and perfonming 
convergence testing, wherein convergence testing checks for convergence of 
the training algorithm, and wherein the modifying of the data set is repeated 
until convergence of the training algorithm occurs; and displaying one or more 

25 subsets of the data set using the data representation. 

According to yet another embodiment, a computer program product for 
computer data analysis using neural networte is disclosed. The computer 
program product includes computer-readable program code for generating a 
data representation using a data set, the data set Including a plurality of 

30 attributes, wherein generating the data representation includes: modifying the 
data set using a training algorithm, wherein the training algorithm Includes 
growing the data set; and performing convergence testing, wherein 
convergence testing checks for convergence of the training algorithm, and 
wherein the modifying of the data set is repeated until convergence of the 
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training algorithm occurs; and computer-readable program code for displaying 
one or more subsets of the data set using the data representation. 

According to yet another emk)odiment, an apparatus for performing data 
analysis using neural networks Is disclosed. The apparatus includes means for 

5 representing a data set, the data set Including a plurality of attributes; means for 
generating the representation means using the data set, wherein generating the 
representation means includes: modifying the data set using a training 
algorithm, wherein the training algorithm includes growing the data set; and 
performing convergence testing, wherein convergence testing checks for 

1 0 convergence of the training algorithm, and wherein the modifying of the data set 
is repeated until convergence of the training algorithm occurs; and means for 
displaying one or more subsets of the data set using the modified data 
representation. 

According to one embodiment of the Inventton, a method of computer 
15 data analysis using neural networks is disclosed. The method includes 
generating a data set D, the data set Including a plurality of attributes and a 
plurality of data set nodes; initializing the data set, initializing the data set 
including: calculating an autocorrelation matrfx, K over the input data set D, 

where K = — Td d'" ; finding two longest eigenvectors of ^, ®»and ®2, 

card(D)vteD 

20 where l^>I^KI; and Initializing vector values of each element of a data 
representation F by spanning it with element values of the eigenvectors; 
generating a data representation using a training algorithm, wherein the training 
algorithm includes growing the data set, growing the data set including: finding 
Kg for each of the data set nodes, where is the node with the highest 

25 average quantization error, arg max ^(.O^^ j for each of the data set 

1 

nodes, where q(t)g^ = '® *® average quantization en-or for node 

q , where: 

= aiginax{|^, - «'<r(,)-u(,»l'|K ~ ■*^<r(«HU{9)>|> 
K, =argmax{||i:, -^<k,).c(,)-i>|K " '^<k,).c{,hi>||} 
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\flK,-K,l<lK,-K,i then 

n, = riy) If r(y) < r(c) , else = r(c) ; and 
n«=c(y); 

else = r(y) ; = c(*) If c(jc) < c(c) , else = c(c) ; 
5 inserting a new row and column after row and column n^; Interpolate new 
attribute values for the newly inserted node vectors using: 

a €17(0,1); performing convergence testing, wherein convergence testing 
checi<s for convergence of the training algorithm, and wherein the training 
10 algorithm Is repeated until convergence of the training algorithm occurs; and 
displaying one or more subsets of the data set using the data representation. 

In one embodiment, perfonning convergence testing includes testing 
condition qiO<Qt' In another embodiment, the training algorithm further 
includes: 
15 f = ^ + l; 

Vde D; 

if (r < 50 or afterGrow) 

g}^ = argmin J|d-F^^|| 

<0erGrow = false 

20 else 

= FwdSCWSifii) 

call function: FindNeighborhoodPattemsCp) 
call iuxv^xoTwBatchUpdateMaschVectors 

25 if ( MayGrow(f) and f < ) , call function: GrowKF . 

In another embodiment, a plurality of display and/or analysis features 
may be included. A composite view may further include: constructing an 
attribute matrix; and selecting a highest value for each attribute value from the 
selected set of attributes. A range filter may be included to select regions on the 
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data representation and filter out nodes based on defined value ranges. A 
zooming function may Include: making a selection of nodes to form a base 
reference of Interest; defining a set of data records from a second data set; 
matching the second data set to the data representation; flagging ail records 
5 that are linked to the matched region; and generating a second data 
representation using the flagged records. Visual scaling may Include changing 
the minimum and maximum values used to calculate a colour progression used 
to visualize at least one of the plurality of attributes, and re-interpolating the 
active colour ranges over the new valid range of attribute values. A labeling 
10 engine may Include: linking attribute columns in an input file to attributes in the 
data representation; selecting attributes from the input file to be used for 
labelling; determining with which row and column each row in the Input file is 
associated; and placing labels on the data representation. An advanced search 
function be Included to: read a set of data records from a data source; match 
15 attribute columns from the set of data records to attributes In the data 
representation; and display a list of all records that are associated with nodes 
that are part of the active selection on the data representation. 

It is to be understood that other aspects of the present invention will 
become readily apparent to those skilled in the art from the following detailed 
20 description where, simply by way of illustration, exemplary embodiments of the 
invention are shown and described. As will be realized, the Invention is capable 
of other and different embodiments, and its several details are capable of 
modifications in various respects, all without departing from the invention. 
Accordingly, the drawings and description are to be regarded as illustrative in 
25 nature and not as restrictive. 

BRIEF DESCRIPTION OF THE DRAWINGS 

These and other features, aspects, and advantages of the present 
invention will become better understood with regard to the following description 
30 and accompanying drawings where: 

FIG. 1 is an environment diagram of a data analysis system, in 
accordance with an embodiment of the present invention. 

FIG. 2 is a flow diagram of a data analysis process, in accordance with 
an embodiment of the present invention. 
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FIG. 3 is an example screen shot of TfnnCompCol and a TfnnSMF of the 
data analysis system, in accordance with an embodiment of the present 
invention. 

PIG. 4 is an example component map colour bar, in accordance with an 
5 embodiment of the present invention. 

FIG. 5 is an example composite filter showing the concun'ent 
visualization of multiple attributes. In accordance with an embodiment of the 
present invention. 

FIG. 6 Is a composite filter in a binary attribute window, In accordance 
1 0 with an embodiment of the present invention. 

FIG. 7 is an example range filter Interface screen shot, in accordance 
with an embodiment of the present Invention. 

FIG. 8 is an example visualization image of an attribute that contains 
outlier data, in accordance with an embodiment of the present invention. 
15 FIG. 9 is an example visualization Image with scaling applied, in 

accordance with an embodiment of the present invention. 

FIG. 10 is an example Illustration of the binarisatlon process, in 
accordance with an emt>odlment of the present invention. 

FIG. 11 is a block diagram of an exemplary architecture for a general 
20 purpose computer, In accordance with an embodiment of the present invention. 

DETAILED DESCRIPTION 

The detailed description set forth below in connection with the appended 
drawings is intended as a description of exemplary embodiments of the present 

25 invention and is not intended to represent the only embodiments in which the 
present invention can be practiced. The term "exemplary" used throughout this 
description means "sen^ng as an example, instance, or Illustration," and should 
not necessarily be construed as preferred or advantageous over other 
embodiments. The detailed description includes specific details for the purpose 

30 of providing a thorough understanding of the present invention. IHowever, It will 
be apparent to those sidiied in the art that the present invention may be 
practiced without these specific details. 

In the following description, reference is made to the accompanying 
drawings, which form a part hereof, and through which is shown by way of 
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illustration specific embodiments in wliich the invention may be practiced. It is 
to be understood that other embodiments may be used as structural and other 
changes may be made without departing from the scope of the present 
invention. 

5 In accordance with one embodiment, the present invention includes a 

data analysis system using a knowledge filter to visualize and analyze high- 
dimensional data. Throughout this specification, the tenm Icnowledge filter" is 
used to identify an optimized representation of an input data set, where the 
optimized representation is constructed during a training process. The 

1 0 l<nowledge filter may also be referred to generally as the data representation. In 
one exemplary embodiment, the training process uses unsupervised neural 
networt<s. In another embodiment, the training process generates the 
representation of the input data considering similarity, in one embodiment, in 
general temis, the knowledge filter includes a number of coupled, or connected, 

15 hexagons called nodes. Considering relevant attributes, two nodes that are 
closer togeUier are more similar than two nodes that are further apart. The 
knowledge filter can be viewed for any particular attribute in the data set using 
attribute window views. Using multiple attribute windows simultaneously, each 
viewing a different attribute, provides for investigative and analytical abilities. In 

20 one embodiment, the attribute window is a colour depiction of complex multi- 
dimensional data in two dimensions. As each attribute is displayed in its own 
window, the dynamics and interrelationships within the data may be identified. 
The attribute depictions can provide insight and explain why and how certain 
events, related to the input data set, occur. In another embodiment, the 

25 attribute window may use grayscale depiction, or other fonnat depiction of data 
where differentiation between attributes can be made. While the included 
drawings and figures are grayscale images, colour implementations may be 
used. 

In general temns, the underiying algorithms of data analysis system train 
30 and create a knowledge filter by allowing repetitive competition between nodes 
for the right to "represenf records from the data set. Winning nodes influence 
their neighboui^, who less influence their neighbours, and so on. Guided by an 
innate desire to accurately represent the input data and code dictating the 
magnitude and direction of its growth, the neural network learns and matures to 
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become an accurate representation of the Input data, expressed In a smaller 
and digestible space and shape that can be embodied as the knowledge filter. 

In accordance with one embodiment, the data analysis system Is a data 
mining and analysis tool, based on the self-organizing feature-mapping neural 
5 network algorithm developed by T. Kohonen. In one embodiment, the system 
constructs a mapping of high dimensional data onto a two dimensional plane. 
The mapping is achieved through an iterative training process. The output of 
the training process is a trained map that can be used for analysis and data 
mining. One example training process is knows as the self-organizing map 
10 {SOM) algorithm. 

The trained map can be used to deduce information embedded in the 
input data that may not have been readily apparent to the user when viewing 
the data in conventional formats. One desirable outcome of the trained map of 
the present invention is the ability to do prediction on any element of a data 
1 5 record, similar to the Input space, that was not used for training. This Is done by 
finding the most similar data record In the trained map, and Interpolating 
attribute values over the found record, or over It and its neighbouring records. 

The following terminology will be used throughout the specification: A 
data record/vector Is a set of values describing attributes of a single occurrence 
20 within an input domain; Ihe system" or "the data analysis system" Is the data 
analysis and data-mining tool; a map is a two-dimensional set of data records 
produced as an output of the SOM algorithm; and the SOM algorithm is an 
unsupervised neural network training algorithm. 

The following concepts and symbols are used through the present 
25 specification: 

Symbol Meaning 

F A knowledge filter 

F^^ The data vector at row r, column c, in the knowledge 

filter F . 

P<rx>i Element I of the data vector at row r, column c. In the 

knowledge filter F . 
Fg The number of rows in the knowledge filter F 
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Fc The number of columns in the knowledge filter F 

'"(^<r^ ) A function that extracts the value of r from ^"''•«* . 

ci^KT^ ) function that extracts the value of <^ from . 

[a,2>] A enumerable list of nominal values, including both a 

and b 

■ A data vector of the formf*'"*'**—'*"^ where ■ 

contains ''elements. 
T(a,) The function ^ returns a Boolean value indicating 

whether element < of vector » is missing or not. 

The minimum value present in a data set for an 

attribute'. 

The maximum value present in a data set for an 
attributed 

cari(d) function returning the number of elements In vector * . 

||a-b|| Calculates the Euclidean norm between two data 



vectors, a and b, for only those elements of a and bthat are 
not missing. Thus, where the regular Euclidean norm ||a-b| is 

defined as ^^(a,-b,f , |{|^ is defined only for elements a^and 

b,oi a and bthat are not missing, i.e. vv'-^<-^)'^-^<*^) . it is 

assumed that both » and contains an equal number of 
elements. The symbol a represents a logical "AND' statement, 
and the symbol ^ represents a logical negation of a statement. 

Thus, the statement '^^^^i^ indicates that element ^ of vector a 
must not be missing. 
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Represents a list of vector values, such as ^b.c}. 
Implicit to this definition is an addition operator e, which 
appends an element to the list. Therefore, the statement 
{a,b}®c results in the list ^.b.c}. A list of vector values can 

also be represented by a barred capital bold letter, such asX . 

Referring now to FIG. 1 , an environment diagram of the data analysis 
system, in accordance with an embodiment of the present invention, is shown. 
The data analysis system 100 receives data 102, which is the input data set, 

5 and provides results 104 to the end user based on a processing of the received 
data 102. In one embodiment, tiie data analysis system Includes one or more 
engines for processing the data 102. A used In this specification, an engine is, 
for example, a computer program, application, process, function, or set of 
computer executable commands that perfonns a function for other programs. 

10 An engine can be a central or focal program in an operating system, subsystem, 
or application program that coordinates the overall operation of other programs 
and engines. An engine may also describe a special-purpose program that 
contains one or more algorithms or uses rules of logic to derive an output. The 
term "engine" is not limited to the above examples but is intended to inclusively 

15 describe computer-executable programs. In the illustrated embodiment, the 
data analysis system 100 includes a knowledge filter engine 106, a training 
engine 108, a clustering engine 110, a visualization engine 112, a composite 
view engine 114, a range filter engine 116, a zooming engine 118, a visual 
scaling engine 120, a labelling engine 122, a search engine 124, and an equal 

20 distance averaging (EDA) prediction engine 126. The composite view engine 
114 may perfomn the composite viewing functions. The range filter engine 116 
may perform the range filter functions. The zooming engine 118 may perfonn 
the zooming functions. The visual scaling engine 120 may perform the visual 
scaling functions. The labelling engine 122 may perform the labelling functions. 

25 The search engine 124 may perform the advanced search functions. The EDA 
prediction engine 126 may perform the EDA functions. The engines may be 
Included in any desired combination. It is not necessary that all of the engines 
be used with the data analysis system 100. One or more engines may be 
combined or work in conjunction with one another. For example, the knowledge 
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fitter engine 106 may utiiize functions of the training engine 108 to generate a 
imowledge filter. Each of the engines may also be combined to perfonm . 
muttiple processes or functions. 

The data analysis system 100 may operate on any suitable general 

5 purpose computer, computer system, sen/er, or other suitable device capable of 
mnning the described system and processes. The data analysis system may 
be coupled to one or more databases for the storing of the input data set, 
program Instructions for the system and various engines, resutts, attribute 
window views, and other functions of the system. 

1 0 In one exemplary embodiment, the input data set may be provided to the 

data analysis system in a predefined format. In one example embodiment, the 
data analysis system receives a matrix of data, where the first row of data 
contains two or more columns containing the names of the variables or 
attributes. The second and subsequent rows contain the data records wtth a 

15 value under each of the attribute names set out in the first row. Missing values 
are denoted by a blanic or '7" entries, or any other indication of an empty entry. 
In one embodiment, the data analysis system processes numerical values. 
However, the data analysis system may also process other desired fomns of 
data. 

20 input data may be in a text file, delimited by tabs or comma separated 

value (CSV) format. Many existing, conventional systems for storing or 
accessing data produce text files in a format that is suitable for the data analysis 
system. Accordingly, the data analysis system may be used with existing data 
and data systems. For example, the data analysis system may also receive 

25 data stored in Microsoft Excel format. Access format, ASCII, text, and any other 
suitable format. 

In one embodiment, a sufficient sampling of data is used to generate 
resutts. For example, many statistical and regression techniques require only a 
sample of the true underlying "population," or complete set of possible resutts. 
30 In one exemplary embodiment, the input data may contain between 100 

and 5,000 records having between 5 and 50 attributes. In another embodiment, 
data having up to 20,000 records and up to 200 attributes may be processed. 
However, any number of records having any number of attributes may be 
processed by the data analysis system. The performance capabilities of the 
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particular computer or system being used, such as, for example, processing 
speed, number of processors, amount of RAM and memory available to the 
system, may determine the quantity of data that may be analyzed at any given 
time. 

5 The following description includes details regarding the training process, 

the mathematics, evaluation criterion, and heuristic optimisations of the data 
analysis system. 

In one exemplary embodiment, the algorithm used by the system 
includes three steps: (1) sampling a training pattern, (2) matching it to the map, 
10 and (3) updating the map to more closely represent the input pattern. 

One exemplary training algorithm is generally summarized as follows: 

1 . Initialization. Construct at grid of weight vectors. The initial weight vectors 
can be initialised randomly, or using an alternate initialisation scheme. It 
would however be useful to ensure that Wy(0) is different for ] s i, 2, . . . , 

15 N, where N Is number of neurons in the grid. 

2. Sampling. Draw a sample x from the input space, x represents an input 
signal (i.e. a data record). 

3. Similarity IVIatching. Find the neuron in the grid that is the most like x , 
using a minimum distance criterion, such as the Euclidean distance. The 

20 best matching neuron i(x) at time n, 

m = argj imn|c(n)- wj , j = 1 ,2, N (1) 

4. Updating. Adjust the synaptic weight vectors of all neurons, using the 
update formula 

w/n)+r7(n)w^.,(3E,(n)[3c(n)-w^(n)],yeA,(3j,(n) 



w,(n+l) = 



(2) 

Wj(n)yOtJierwise 



25 where rj(n) is the learning-rate parameter, and A,(s,(n)is the 

neighbourhood function centred around the winning neuron iix), both 

J7(n) and A,(j)(n)vary dynamically for improved results. 

5. Continuation. Repeat steps from 2 until no noticeable changes in the 
weight vectors are observed. 



30 



A choice for n.j is tiie Gaussian type function 
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(3) 



where (T is the effective width of the Neighbourhood at a specific time. It may be 
caiculated as 

<y(n)*o-oexp^-;^j (4) 
The leaming rate may aiso be decayed over time using 



»7(rt) = J7oex] 



(5) 



One measure used to evaluate the representation of the input space in 
the trained map Is the quantization en'or. Quantization error is defined as the 
average sum of distance from each training pattern and it corresponding best 
1 0 matching unit. The quantization error for a single training pattern jc Is defined 
as 

d(x,w^) = tmn{dix,Wj ) (6) 

where dix,Wj) represents the Euclidean distance between 3c and Wj , and 

j=1 ,2,...N, c is the Index of the best matching weight vector. 
1 5 The global quantization error then is 

^.=i£^(^i.'^c) (7) 

where P is the number of training patterns. 

A Batchiy^ap algorithm is an optimisation algorithm that may be used In 
the data analysis system. The BatchMap algorithm may provide accelerated 
20 training of self-organizing maps. 

One exemplary version of the BatchMap algorithm is given as: 

1 . For the Initial reference vectors, take, for Instance, the first K training 
samples, where K Is the number of reference vectors. 
25 2. For each map unit i, collect a list of copies of all those training samples x 
whose nearest reference vector belongs to the topological 
Neighbourhood set Ni of unit I. 

3. Take for each new reference vector the mean over the respective list. 

4. Repeat from 2 a few times. 
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10 



Another exemplary batch algorithm is as follows: 

1. iriitlalise the model vectors m, (Any suitable initialisation scheme can be 

used.) 

2. For each unit, j. compute the average of the data vectors that the unit ] is 
the best match for. Denote this average with Xj . 

3. Compute new values for the model vectors m, using the equation 

m, = ^ , where J goes through all the model vectors. The temn hj, 

is the neighbourhood function of the SOM and itj is the number of data 

vectors that the unit j is the best match for. 

4. Repeat steps 2 and 3 until convergence criteria are satisfied. 

An exemplary batch SOM algorithm is given as follows: 

• Initialise weight vectors 

• r=0 

• for epoch = 1 to do 

• Interpolate new value for neighbourhood width, a(0 



• For record = 1 to Nrecord do 

i. t = t-^\ 

ii. forl<=1toKdo 

1 . Find best matching unit 

iii. Forl<=1toKdo 



• Initialise numerator and denominator in = 




too 



1 . accumulate numerator and denominator in 



25 
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• forksitoKdo 

i. update weight vector using 



FIG. 2 is a flow diagram of a data analysis process, in accordance with 
an embodiment of the present Invention. The process illustrated in the flow 
diagram of FIG. 2 is one example of steps In the data analysis system. The 
process may operate with less than the illustrated steps, including additional 
step, in other orders of operation, or with other desired modifications and 
variations. Some of the listed steps include functions described in greater detail 
in other sections of the specification. In step 200, the data input by the user is 
formatted for input to the data analysis system. In step 205, the knowledge filter 
is initialized. In step 210, the training algorithm is executed. In step 215, a 
variable is set to zero (0), whteh is used to control the operation of the training 
algorithm. In step 220, the RndSCWS is called to detennine the most similar 
matching node. In step 225, the RndNeighborhoodPattems function is called to 
find all nodes that fall within the cun'entiy considered node's neighbourtiood. In 
step 230, the BatchUpdateMatchVectors in called to update the feature vector. 
In step 235, the GrowKF function is called. Using GrowKF, the size of the 
knowledge filter is increased to allow it to better capture the input data space. 
In step 240, a check Is performed to detemriine if the algorithm has converged. 
If the algorithm has converged, then the algorithm is stopped and the 
knowledge filter is stored in memory for analysis, step 245. In step 250, if the 
algorithm has not converged, the control variable t is incremented by one and 
steps 220 through 240 are repeated. In step 255, analysis may be performed 
using the stored knowledge filter. Analysis using the knowledge filter Includes 
performing EDA predictions, composite attribute viewing, performing range filter 
analysis, visual scaling, individual labeling, advanced searching, zooming 
functions, and other desired data analysis functions. 

In accordance with one embodiment, a more detailed description of the 
knowledge filter creation process is provided below. 
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Since data can be read from different data sources, data used for 
creating the knowledge filter Is stored In memory In a generic container class 
that is independent from the' source of data. The generic container class, 
referred to in the following sections as a training set, is a list of data vectors, 5, 
5 where d, Is the ith vector In D , and d,.^ is the jih element of vector I . 

The input data may be subject to a data preparation process. In an 
exemplary data preparation process, a data scaling process and a binarisation 
process Is be perfonned on the training set. in an exemplary embodiment, 
before the training algorithm commences, the complete training set D may be 

10 pre-processed. Pre-processing may Include a two-step process. Each data 
vector in the training set Is scaled to the range [04]. the first step, and then 
flagged attributes are binarised, the second step. In an exemplary data scaling 
process, each element In each data vector In the training set D Is replaced by a 
scaled representation of itself. The scaling process thus entails: 

15 Vi e Il,carrf(d)] , Vd, e D 




In an exemplary binarisation process, the system automates converting 
attributes, Including non-numeric and non-scalar attributes, into one or more 
toggled attribute values. The binarisation process is discussed below In greater 
20 detail. 

In one exemplary embodiment, the knowledge filter may have a 
predefined structure. In one embodiment, the knowledge fllterF consists of a 
two-dimensional grid of positions, called nodes. Each node has an associated 
row and column position. A specific node position is references through the 
25 notation where < r,c > indicates a specific row and column position. Each 

node is considered to be a hexagon, implying that It is adjacent to six other 
nodes. For a knowledge filter with iRTj^i^ows and JS:,; columns, nodes are 
arranged in the following fashion: 
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In an exemplaiy knowledge filter initialisation process, node values are 
Initialized through a variant of a technique called Principal Plane Initialization. 
The Initialization algorithm Includes the following steps: 
5 Calculate an autoconrelatlon matrix, K over the input data setD, 

where K = — TjA d^ . Note that d-d^is a vector multiplication operator, 

and not the inner product. 

Rnd the two longest eigenvectors of , *» and , where ^ 1**1 . 
initialize the vector values of each element of the knowledge filter F by 
10 spanning it with the element values of the eigenvectors. The following 
initialization rules are used: 

15 4. F<F,Fc>^*2 



5. Vce[2,Fc-l], F^^:=-/^F^,^,+^^F^^ 

re —c re 



Pc-c 



6. Vc € [2, Fc - 1] , F^^ ?= -rr ^<fi, j'c> ^ — ^<fR^ 



7. Vre[2.F,-l], F^^:=-^F^,.^+^^^F^j, 

Fr fR 



8. Vre [2.F, -1], F,,.,^, '=^F<f,.Fc> +%-^^<r.F,> 



20 9. Vre[2.F«-l], Vce[2,Fc-l], ?=^F^.p^, +%-^F^j, 

Fc ''c 



A more detailed knowledge filter training algorithm, In accordance with an 
embodiment of the invention, is described as follows. 
The following constants may be user definable: 
25 - 'max ■ Th® maximum number of training iterations. 

- fi^ : A minimum quantization error threshold. 
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In addition to the above, the following symbols relevant to the training 
process are defined: 

- 1 : The current training step. 

- qiO : The current quantization error 
5 - afterGrow: Boolean value (initially set to false) indicating whether the 

algorithm perfomi a map growing in the previous Iteration. 

For each training pattern din D, define a construct retaining the position 
of the row and column position of the data vector in the knowledge filter that 
most closely represents the vector d. For training pattem d. fp^ represents this 
10 position. 

Convergence testing ascertains whether training should stop. This is 
done by testing whether the conditions q(t) < Q, and / < hold. 
An exemplary training algorithm is as follows: 

15 1 . Initialize knowledge filter F from the data set D . 

2. f = 0. 

3. Perfomi the following steps, while the algorithm has not converged: 

a. * = f+l 

b. VdeD 

20 1. if (f < 50 or afterGrow) 

1. Pd= argmin ||d-F<,J 

2. afterGrow = false 

ii. else 

1. = FMuiSCWS(d) 

25 III. FmdNeighborhoodPattenis(p) 

iv. BatchUpdateMatchVectors 

vl. if (AfoyGrovKO and f<f^) 
^.GrowKF. 

30 The above algorithm contains the following functions: 
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FindSCWSid): Determines the most similar matching node in F to d 
using a technique that is computationally less expensive than iterating over all 
the nodes in F . The technique works as follows: 

For d, determine ^^(r-l) 
5 Build a list, , of all the nodes neighbouring F^^ . 

If ||d-F^J|^<||d-NaJ^, VfeN^, return the answer g}^, else set 

= argimii||d - Naj|^ and repeat from step 2. 

FindNeigliborhoodPattemsCp)'. finds, for each node in the knowledge 
filter, all the nodes that fall within its neighbourhood, using a currently defined 
10 neighbourhood width. Neighbourhood width is simple, linear function dependant 
on the current step and the maximum number of steps. Thus the neighbourhood 
width, at any time step t, is defined as t?(0 = (l-(^ We also define 

w^=(^« + ^c^. Each knowledge filter node also has a list of matched 

positions, K^^^ associated with It. 
15 The FuidNeigJiborhoodPattemsC^ function then has the following effect: 

Calculate ri(t). 

1. w = r?7(0w„l 

2. VdeD: 

a. p, = max{c(p4 ) - w,0} 

20 b. p, =min{c(pd) + w,Fc} 

c. p, = max{r(p4 ) - w,0} 

d. p^ = mwi{r(pa ) + w, Fj, } 

e. Vreti?,,/?^], Vce[p,,pJ 

f. S^^{K^-r(p^if +{Kc -ciip^)^ 

25 g. If (5 ^ w) 

i. Add dto K^^^ 
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GrowKF : Growing the knowledge filter Increases Its size to allow it to 
better capture the input data space. The algorithm, when a growing step is 
triggered, functions as follows: 

1. Find K^, the Icnowledge filter node with the highest average quantization 
5 error, i.e. arg max ^ j, for all knowledge filter nodes, where 

q(t)^ '-^'SVOic, the average quantization error for node qoyer the 

previous training steps. 

2. K, = argmax{||A:, - K^i,).uw>l^, - ^<,(,Hi.e(*»|> 

3. K, - argmax{|^, - ^^<,(,)^(,)-i>||.|^, - J'^<r(,M,)*i>|} 

10 4. if |js:,-Jc4<||J5:,-Js:,|then 

a. n, = r(y) if r(y) < r(c) , else », = r(c) 

b. = c(y) 

5. else 

a. n, =r(y) 

15 b. itj = c(jc) if c(*) < c(c) else = c(c) 

6. Insert a new row and column after row and column n^. 

7. Interpolate new attribute values for the newly Inserted node vectors using the 

formulae : K^^.^ 
where a e 17(0,1) 

20 BatchUpdateMatchVectors'. updates the feature vector associated with 

each knowledge filter position, based on all the data vectors that were matched 
to it from D . 

The algorithm functions as follows: 
1. v = 0.05 
25 2. Vre[l,F«], Vce[l.Fc] 

a. /,=0 

b. g = 0 
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I. 

II. g = g+d 

Iv. J5:^^=g+vh 

5 

The following Shortcut Winner Search (SCWS) may accelerate the 
training process in that it decreases the computational complexity of searching 
for a best matching unit (BMU) associated with a particular training pattern. 
After a number of training epochs, the map tends to become organized, i.e. the 

10 sum of corrections made to the weight vector of a particular neuron in the map 
Is small. This dictates that the BMU associated with a training pattern may be In 
the vicinity of the BMU of the pattern, at a previous epoch. SCWS therefore 
tracl<s the position of the BMU associated with each training pattem after each 
epoch. This is then used to calculate the new BMU, starting the search at the 

1 5 position of the BMU at the previous epoch. 

Each unit not on the perimeter of the map is surrounded by six units. 
SCWS evaluates the node indicated by the saved BMU, and all surrounding 
neurons. If the saved BMU is still the BMU, no further evaluation is done. If one 
of the six direct neighbour units is found to be a better match, the search is 

20 repeated with the new best match as the centre node, and if s six direct 
neighbours are evduated. 

The SCWS algorithm can be summarised as follows: 



1 . Retrieve the BMU position calculated at a previous epoch 
25 2. Recalculate distance to the BMU 

3. Calculate the distance to all direct neighbours 

a. if the BMU found at a previous epoch is still the closest match to 
the training pattem, stop the search 

b. Determine the closest perimeter unit, and make it the BMU. 
30 c. Repeat from step 3. 



in one exemplary embodiment, map Initialisation is performed by an 
algorithm referred to as the SLC initialisation technique, in one embodiment, 
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the SLC initialisation technique attempts to find a large enough hyper cube to 
cover all of the training patterns. To this extent, the algorithm finds the four 
extreme training patterns. This Is done by first finding two training patterns with 
the largest Inter pattern Euclidean distance. A third pattem Is then found at the 

5 furthest point from these pattems, and the a fourth pattem is found, at the 
furthest distance from the three pattem already identified. These pattems are 
used to Initialise the map neurons on the four comers of the map. All remaining 
neurons are then Initialised by Interpolating weight values for each attribute 
according to the values at the four comers of the map. Another example 

10 initialisation technique is random Initialisation. 

An example map initialisation technique is given as follows: 
Assume an NxN map, and iv,,, designates the neuron at row x and column y, 

and y designates the weight vector of the same neuron; 

1. Rrst select a pair of input pattems from the training set whose inter- 
15 pattern distance is largest among all the pattems In the training set. The 

vector values are used to initialise the weights of the neurons on the 
lower left and upper right comers of the map respectively. (I.e. w„^an6 

Wi„). From the remaining pattems, the vector values of the training 
pattern the furthest from the two pattems already selected, is used to 
20 initialise the neuron on the upper left comer. (I.e. w, ,). The neuron on 

the lower right corner of the map Is set to the coordinates of the pattem 
that is the farthest from the previously selected three pattems. 

2. Weights of neurons on the four edges of the map, can be initialised using 
the four following equations: 

25 w^^j ^^m^U-l) + yv^,^ for j = 2,....iV-l (3) 

iV — 1 

^sj =^^^j^^U-l)+^sA forj = 2.....iV-l (3) 
W,i=-^5^i!5^^(i-l) + w,. for^^ 

fori = 2,....iV-l (3) 
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Slnce two points form a line in the input space, the line is unifonnly 
partitioned into N-1 segments, and the ending points of the segments are 
used to initialise the weights of the neurons. 
3. The remaining neurons are initialised using a top to bottom, left to right 
5 parsing scheme. This is explained using the following pseudo code: 

For i from 2 to N-1 

For j from 2 to N-1 



10 A Principal Plane initialisation process may be included. An additional 

description of an exemplary principal plane initialisation process is described. 
The Principal Plane initialisation process requires 0(n) data set passes, and at 
that, only a single pass is needed. An exemplary algorithm is as follows: 

1. Calculate the autocorrelation matrix (inverse oovariance matrix) of the 

16 Input data:C„=— YjcCOjc'^CO- S is the data set. C„is a square matrix 

with dimensions equal to the dimensionality of S. 

2. Find the two largest(longest) eigenvectors. 

3. Initialize the initial neuron space by spanning it with the attribute values 
of the two eigenvectors. 

20 An additional description of the Map Growing process is as follows: 

1 . Initialise a small network 

2. Grow until optimal map size 

a. Train for X pattern presentations 

b. Rnd the map unit, wc, with the largest average quantization error 
25 c. Find the furthest neighbour, wx in the x-dimension, and wy in the 

y-dimension 

d. Wn = 0.5(wx +wc)a or Wn » 0.5(wy •fwc)a , where a e [0,1], and 
so for £dl units in the row & column 

e. Stop growing when 

30 i. Max map size is reached (#neur0ns<=#training patterns, or 

#neurons = p x #patterns, where Pe[0,1] 



wo 2005/006249 PCT/AIJ2003/000881 

-24- 

ii. Max quantization enror for a neuron is less than a tlireshold. 
Hi. Global map convergence has been readied. 
3. Refine map through nomnai training. 

5 During each epoch, the cunrent Neighbourhood width Is calculated using 

the following linear fonnuia: 

At epoch e: 

MaxWidth = (#rows -i- #column8) / 2; 
10 If (e < epoch_threshold) 

New_wldth = (1 - (e / epoch_thre8hold)) * MaxWidth * 0.8 

Else 

New.width = 0, 

15 where New_width represents the new neighbourhood width for epoch e, and 
epochjhreshold is a factor that is specified by the user. Its effect is in principle 
to limit the size of the Neighbourhood after at certain number of epochs have 
transpired. It is also used to instil a linear decrease in Neighbourhood size. 
The 'Gaus'-factor mentioned above, is calculated as follows: 
20 Information needed: 

o Current neighbourhood width at epoch e 

• TheBMUp 

• Coordinates of the map unit cun-ently considered. 
The Gaus factor is then calculated as: 

25 

exp (-(distance between BMUp and map unit)2/ (2 x (Current Neighbourhood 
width)2) ). 

This factor is equal to the normal gaussian distribution function. 
30 The following concepts are used in tine following pseudo code training 

algorithm: 



1 . MatchLlsts: Associated with each neuron is a Boolean vector, called a 
MatchUst. The MatchList dimension is equal to the number of pattems in 
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DC. When a training pattern Is found to be within the Neighbourhood of a 
particular unit, its corresponding entry in the unif s MatchUst is toggied. This 
is a fast and simple way to track patterns in the topological neighborhood. 

>. Randomization of data set: Randomisation produces a list of vector indexes 
in a random order. For each pattern, a set of two random indexes is 
calculated and these vector indexes are swapped: 

srand((unsigned) time(&t)); 
for (int p=0;p < patCount ; \>++) 

{ 

pat_idx1 = randO % patCount; 
patjdx2 = rand() % patCount; 
tmp = (*RandomPattems)[pat_idx1]; 

RandomPatlems->insert(pat_idx1 , (*RandomPattems)[pat_idx2D; 
RandomPatterns->insert(patJdx2, tmp); 

} 

Note that BatchMap calculates a new weight vector as a mean value of ail 
training patterns that were found to lie within in the topological neighbourhood of 
the map unit. It does therefore not matter in which order pattems are presented 
to the training algorithm. The above randomisation algorithm need therefore not 
be used. 

3. Factors and flags 

-WEIGI-ITCHANGES: Constant, representing a number of epochs. After 
every number of epochs, as specified in WEIGHTCHANGES, the map is 
25 grown, if possible. 

-AfterGrow: A flag value, indicating whether the map was grown during a 
the previous epoch. 

The training pseudo code algorithm is as follows: 

30 For Each epoch e, 

1. Clear ail MatchLists 

2. For each pattern p in DC 

■ Determine the previous BMUp 



10 



15 



20 
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■ If (e < 50) or AfterGrow 

• Calculate BMUp using exhaustive searching over the 
complete map 

• Toggle AfterGrow 
5 " Else 

• Calculate BMUp using Shortcut Winner Search 
" Update BMUp 

3. For each map unit 

■ Determine all patterns in its topological neighborhood 

10 • Calculate a new weight vector as the mean over all the training 

patterns In its MatchLlst 

4. Calculate the quantization error eq over all p in DC. 

5. If (e%WEIGHTCHANGES)=»0 

■ Grow the neuron map 
15 ■ Toggle AfterGrow 

The following exemplary software classes may be used: 

BMUXY: Represents the row and column position of the Best 

Matching Unit associated with a training pattern. 

Bvector: Vector of Boolean values 

Data: Base class used to wrap a data set. Do not use 

directly. Derive a class from this base class. (Note: 
This base class is not at>stract) 

DataReaden Wrapper class to read data that was written In a 

binary fomiat to disk by the DataWriter class. There 
is no relationship between this class and the Data 
class and its derivates. The DataReader wraps 
reading primitive types from disk, as well as complex 
types such as vectors and matrices. 

DataWriter: Wrapper class to serialize information in a binary 

format. Tliere is no relationship between the 
DataWriter class and the Data class and its 
derivatives. DataWriter wraps writing primitive types 
and complex types such as vectors and matrices. 
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Dmatrix: 
DVector : 
■matrix: 

IndicatorStats: 

IVector : 
LabelData: 



LabelListEntry: 



MapContainen 



NeuronMap: 



PatternList: 



PosEntry: 

PosList:' 
RecallData: 

SmartMatrix: 



SOM: 
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A class representing matrix values of type double. 

Vector of double values. 

A class representing matrix values of type Int. 

A struct containing values calculating during each 

clustering step 

Vector of Int values 

List of labels that may be shown on a map. The 

actual data vector of each map position is kept, as 

well as a string that Is displayed. 

Wrapper class containing a label's relative position 

on a map, the label caption, and a VCL TIabel 

instance. 

Wraps a trained SOM map that was written to a .smb 
file. It contains the map vectors, as well as statistics 
generated about the map, such as the u-matrix, 
quantization enor information, frequency information, 
component map colour Indexes, cluster Information, 
and all colouring infomiatlon. 
Wrapper class for a grid of SOMNeuron instances. 
Contains methods to update neurons as well as grow 
the map. 

Used by the UnsupervisedData class to maintain a 
list of the current set of training patterns managed by 
the class. 

Represents a single position in the neuron map. 

Used by the SCWS algorithm in the NeuronMap 

Linked list of position in a neuron map. 

Wraps data read from a file, that is to be used for 

recall, or the predict function. 

A matrix of double values, with the added capability 

to perfomi functionality related specifically to 

clustering. 

Wrapper class for a Self-Organizing map. Maintains a 
NeuronMap, as well as members that take care of all 
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SOMNeuron: 



TcurrPos: 
TfrmClusind: 



TfrmCompCol: 



TfrmComponents: 



TfrmDatVlewer. 



TfrmGetLabel: 



TfmnMain: 



TfrmMDIChildrenList: 
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the training issues. 

A single neuron in a SOM. Maintains a weight vector, 
as a well as links to all matching pattern in the 
training set. 

Manages a position in a neuronmap. 
Fonm. Displays the calculated cluster Indicators, and 
allows the user to change the numt>er of cluster that 
is to be displayed. Also allows the user to switch 
between flat and shaded clusters. 
Shows a single representation of the SOM. such as 
Clusters, quantization error, frequency, U-Matrix or 
component values. Handles the selection of position 
of the map, and triggers updates to alt other 
TfnnCompCoi instances that are visible. Handles the 
selection of value range on the component map 
colour bar. 

Shows a summary of a data set before training 
commences. Allows the user to change basic training 
parameters that will influence the map training 
process. Also allows the user to cancel the training 
process. Spawns a training thread, and shows the 
training progress in a graph format. 
Shows a grid container data patterns that were read 
from a data file, on which recall, or the predict 
function, is to be done. 

Allows a user to enter information about a label that 
is to be displayed on a SOM map. The user can also 
change the font information. 
Main application form. This form is a 
MDIcontainer/MDIParent fomi. All other fonns are 
children of this form. Handles, other than the default 
windows processing, updates/changes in the 
displayed information. 

Shows a list of all the windows that are currently 
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TfrmPathSettings: 



TfrmPredlct: 



TprefSettings: 
TfrmSelectComponents: 



TfrmSMF: 



TfrmSplash: 
TfrmStats: 



TfrmWhat2Save: 



TimagelnfoContainen 



TimeSeriesData: 
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displayed. The user can then elect to close some of 
these windows. 

Allows the user to change settings that will 
enable/disable neurons available for selection when 
dr€^wing a path. 

Aids the recall, or prediction, process. Allows 
specification of input and output files, and viewer data 
that are to be recalled. 
Allows the user to change preferences 
Displays a list of component windows that may be 
selected, and indicates to the user which windows 
are cunently shown. The user can then decide what 
to display, and what not. 
Uses a tree view to display the structure of the 
infomnation that may be represented by the SOM. 
This information includes the U-Matrix, Clusters, all 
the possible components, frequency and quantization 
enror infomnation. Also allows the user, additional to 
double clicldng on an entry in the tree view to select 
to show individual components, or to show all the 
components in the map. 
Splash screen & about box. 

Shows statistics on the map. Statistics may be shown 
for the complete map, a selection, a neighbourhood, 
a cluster or a single node. 

When saving a bitmap of a displayed map to file, the 
user can choose to save the map as it is shown (with 
iat>eis, selection, etc) or only save the basic map. 
Worl<s with the TfrmCk>mpCol class. Wraps a bitmap 
that is displayed by the TfrmCompCol, and maintains 
map information that can be used by processing 
methods of a TfmnCompCoi instance, 
inherits its basic traits from the RecallData class. 
Wraps functionality to manage a list of sequential 
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positions on the SOM, and manages tracking the 
current position In the list. 



Tlndlcators: 



Wraps a list of TlndlcatorStats stnicts. Introduces 
methods to calculate the indicators. 



TprogressWindow: 



Generic window that Is used in several places to 



show the progress of a specific task. 
TSOMTrainingThread: Works with an instance of the SOM class. Handles 



In one embodiment of the invention, clustering may be used. Clustering 
within the context of the data analysis system may serve two purposes: (1) 
cluster membership of map units may be used when prediction Is done. When 

5 predicting attribute values using a neighbourhood, only map units within the 
same cluster as the best matching unit are utilized to calculate a weighted 
mean. Cluster membership implies similarity, and without correct cluster 
information, prediction may be inaccurate; and (2) a graphical map showing 
clusters may be constructed. Aside from measures that may be calculated to 

10 find the best clustering, dusters shown should confimi knowledge about the 
data, such as is the case In classification problems. 

The following section describes theoretical and implementation details for 
the classical Ward clustering algorithm, and a SOM-Ward algorithm, utilizing 
map specific topological information to construct clusters. 

15 Ward clustering follows a bottom-up approach. The algorithm places 

each data unit considered for clustering, in its own cluster. An iteration of the 
algorithm identifies two clusters, which are then merged. This process is 
repeated until the desired number of clusters has been constructed. 



Unsupen^sedData: 



VecNode: 



VectorLlst : 



the complete training of a SOM. This is done In a 
separate thread, outside of the main application 
message processing loop, to avoid having to perform 
const hardcoded GUI updates in the Borland API.* 
Wraps all the data in the training set of a SOM. 
inherits basic traits from the Data class. 
Represents a single cluster in the SOM. Used while 
calculating clusters 

Linked list managing a list of VecNode Instances. 
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Identification of clusters for merging is done using the Ward distance, discussed 
below. 

Ward clustering Is characterized by the following variance criterion: The 
algorithm has as its goal to produce clusters with small variance over its 
5 members, and large variance between clusters. Therefore, at each iteration, 
clusters are merged that will contribute the least to the global variance criterion, 
which increases at each step. 

The distance measure is called the Ward distance, and is defined as: 

10 • Two clusters are denoted by r and s, and denote the number of data 
points in the clusters, and Xr and Xg denote the mean over the cluster member 
vectors. 

The number of data points and the mean vector of the cluster are 
updated as: 

15 4"^''^'. (nr-Xr + ns-Xs). (9) 

/j<"*'*>:=n, + ns (10) 

This update is analogous to recalculating the centre of gravity of a set of point 
masses, l-lere, the coordinate vectors of a point mass in an arbitrary space Is 
represented by x and its point mass by n. 

20 One example Ward clustering approach is as follows: 

Repeat until the desired number of clusters have been reached 

a. Find 2 clusters with minimal Ward distance, as characterized by 
equation (8) 

b. Update the new cluster, using equations (9) and (10) 
25 c. Update the number of clusters 

SOM-Ward clustering is similar to Ward clustering but adds a heuristic to 
ensure that all nodes belonging to a cluster is topologicaiiy related. (i.e. tiiey lie 
next to each on the map) This can be achieved by biased the calculating of the 
Ward distance between nodes, and accordingly between clusters. Equation (8), 
30 describing the Ward distance between clusters r and s, can be redefined as: 
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, If r and s are not adjacent. 



d„ Othenwise (11) 



5 As the above algorithm always searches for two clusters with minimal Ward 
distance, it follows that any two clusters with an inter-cluster distance of oe will 
not be considered to be merged Into one cluster. The result of equation (1 1) can 
be regarded as the SOM-Ward distance. Further references In this section to 
the Ward distance may be regarded to be the same as references to the SOM- 

10 Ward distance. 

Several issues come into play when considering the implementation of 
the above clustering algorithms. One consideration, aside from producing good 
clusters, is to optimise the clustering process, as it can be computationally 
expensive. 

15 Ward clustering does not consider topological locality of map unit when 

calculating clusters. Map units are therefore solely merged based on their 
representative attribute values. Each map unit is initially regarded as a cluster. 

In order to find two clusters with minimal Ward distance, the inter cluster 
distance for ail clusters have to be calculated, and then searched for the 

20 minimum. One way to do this is to construct a distance matrix of all the Inter 
cluster distances over all dusters, llie distance matrix is constructed such that 
row and column indexes are significant. Such a matrix may be upper or lower 
triangular. This detail does not matter, as long as the same convention is used 
throughout the implementation. Equation 4 shows a 4x4 lower triangular matrix: 



25 ^' (12) 

^41 *43 - 

Here, a21 indicates the distance between cluster 2 and cluster 1. The diagonal 
of the matrix contains all zeroes to indicate that they represent a distance that is 
not of interest. Values on the diagonal are never considered, and are therefore 
insignificant. This is not be confused with discussions regarding optimisation, 
30 where a distance of zero will become significant. 
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Map sizes are often very large, and sizes of 20x20 units are common. 
Following from the above discussion, sucli a map would require a distance 
matrix with dimensions of 400x400. Although memory wise, not really 
expensive, computationally however, it would take considerable time to 
5 process. Also note that as soon as this matrix has been calculated and the 
minimum inter cluster distance has been found, the matrix needs to be 
recalculated. 

It Is possible to avoid calculating a matrix, and parse the list of clusters 
linearly, searching for the minimum distance. As in the above matrix 

10 calculations, large numbers of calculations that have already been made would 
be repeated. This can be avoided if one considers that the only distances that 
will change in the distance matrix would be those relating to the clusters that 
were merged. As an example, consider a matrix as was shown In equation (12). 
This matrix represents inter cluster distances for 4 clusters. If the distance 

15 matrix was to be processed, and a32 found to contain the smallest entry, it 
would indicate that clusters 3 and 2 would be merged. If these two clusters are 
merged, the initial total number of clusters would decrease from 4, to 3. This 
change needs to be reflected in the distance matrix, and can be achieved by, 
for 4 clusters, clusters 2 and 3 to be merged: 

20 Deleting row and column 3; and 

Recalculating all distances in row and column 2. 
This will result In a new matrix: 

0 - - 

«2, 0 - (13) 

25 The above can be fonmalized as the following heuristic for updating and 
maintaining a distance matrix In consistent format: 

If a new cluster is to be constructed from clusters a and b: 

the new cluster index would be whichever is the smallest of a and b; 
30 all remaining cluster indexes higher than the largest of a and b, Is decreased 
by a single step; 

complete row and column at position b, respectively, are removed; and 
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complete row and column at position a, Is updated to reflect recalculated 
distance to the new cluster. 

This realises the cluster-Indexing scheme, where the Index range changes from 
1...i<to1...(i«-1). 

Calculation of the Ward distance may be adapted using the following 
equation: 



if n, =0 
0 or 

n,=0. (14) 

rri^-||*r-^.r otherwise, 
n. +n. 



Note that r and s represent two clusters, and nr and ns represent the number of 
input data patterns tliat map to clusters r and s respectively. This adaptation is 
necessary to cater for the situation where there are no input data pattems 
mapping to a particular cluster. In large maps, (and even is very small maps, 
depending on the data set) this situation Is common. If this adaptation were not 
talcen Into account, the Ward distance would not be calculable. 

By calculating the Ward distance using the above method, several 
entries in the distance matrix will be zero. Merging of these "empty" clusters will 
continue until no empty clusters exist. Note that If a single "empty" cluster 
exists, there will be several zero entries in the distance matrix. As a heuristic 
solution to the problem of deciding which clusters to merge, clusters that are the 
closest, using the Euclidean norm, is merged. 

The data analysis system calculates, for each of the last 50 clusters 
found when doing clustering, an indicator that serves as an evaluation of the 
25 clustering. This can be regarded as a "goodness" measure of a set of clusters. 
Indicators are calculated using the minimum Ward distance for each clustering. 
A ratio is calculated between a set of two clusterings, (e.g. between the 
minimum Ward distance for 20 and 19 clusters) The ratio Is then normalized 
using a process discussed below. 
30 For each c clusters: 
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the symbol c represents the current number of clusters; and 
the function d(c) represents the minimal Ward distance for merging c into 
c-1 clusters. 

The exact fomnulas to calculate the Indicator 1(c) for c clusters are: 

5 

1(c) max(0, /' (c)).100 (1 5) 

where 

I\c):^-^'l (16) 

/1(c) Is defined as: 

10 n(.c):^dic)jc' (17) 

-p is the linear regression coefficient for the point , and 

_^-£±z5 (18) 

Y ln(c) and S := ln(d(c)) . is the conrelation coefficient between y and S . The 
1 5 correlation coefficent is defined as 

r = J=! (19) 

where x and y represent two correlated data points. Equation (19) can be 
simplified for calculation. The simplified version, using symbols from our 
regression, Is: 

M " M M (20) 



20 = 



In accordance with one exemplary embodiment, cluster shading may be 
used. Cluster shading is a technique that shades the colour of individual nodes 
in a cluster according to the distance of those nodes from the gravity centre of 
25 the duster. In order to do the colour adaptation, the centre of gravity of each 
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cluster has to be calculated. The centre of gravity for cluster k, where cluster k 
consists of neurons, can be written as: 



where n^^ls the number of data patterns that have node x^as their best 
5 matching unit. The furthest point from In the feature vector space of the 
cluster need also be Identified, to be able to scale relative distances from 
to all the neurons In the cluster. This furthest point In cluster k is designated by 
j^. The Euclidean distance between x^and jc^^ is calculated, 

^/^^ll^W-Jf/keool* '® detenmlne a scaling factor for each 

10 neuron In the cluster, that is then used to calculate the intensity of the 
associated neuron colour. This associated cotour Is calculated as follows: 

For each cluster: 

1 . Determine the cun-ent cluster colour from a predetermined collection of 
15 predefined colours, • 

2. For each neuron in tiie current ciusten 

a. For each neuron n, with a feature vector x„ , a distance factor 




(21) 



20 



25 



b. 



c. 



d. 



e. 



f. 



is calculated as = p„ - [ . 

A scaling factor, sn, is calculated as s„ « n^j^/kj^ 

s„ Is adapted with a factor to enlarge the intensity adaptation that 

will following, as s„ = s„xx , where a is a constant. 

A copy is made of the cluster colour, Cn = • 

C„ is decomposed into the three base colours, red, green and 

blue, Individually designated by CCf and 

Each of the individual colour is then adapted using the scaling 

factor, .;.tobe C = C-^:.Cf = C-^>cl C'C-^l-By 

decreasing the Individual base colours, a gradual decrease in 
colour intensity can be achieved. 
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g. The individual coioure are combined into a single colour identifier, 
by shifting each colour component to it's proper position, and 
OR'ing them together, C = I Cf I C • 
C' is then the colour used to draw an individual neuron on a cluster map of a 

5 problem domain. 

In accordance with another embodiment, map visualisation may be used. 
Map visualisation may be perfonned using the TfnmCompCol class, as 
described in the above class definition section. The stmcture of the data 
analysis system is such that several instances of the TfnnCompCol class can 
10 be active as MDIChildren within the TfnnMain MDIParent window. Any running 
instance should be able to dispatch GUI updates on an ad hoc basis. To this 
extent, a TfnnCompCol Instance will be linked to a TfnnSMF Instance, and the 
Tf rmSMF will act as a message sen/er. All GUI updates that are not relative to a 
specific map, will be sent to the TfrmCompCol's associated TfrmSIViF, which will 
15 brother the requested update as necessary. FIG. 3 shows an example of a 
TfmnSMF (caption: ReactorSpecs) and a number of component windows. Any 
change or GUI request that affects every component window, is sent the 
TfnnSMF instance, which in turn updates all the visible component windows, 
this allowing for some degree of optimisation. Compared to the processing 
20 overhead needed to detemnine map associations from sources other than a 
rigidly maintained data structure with direct links to all related windows, (such as 
getting a list of open windows using the WInAPi, and detemnlning form the list 
what open windows are available, and whteh of these are part of the cunently 
shown map), this design choice provides a desirable outcome. 
25 Example functionality originating from a component map that is not 

brokered by the TfnnSMF, are updates of the position within the map that is 
shown on the status bar, indicating a relative position as the mouse is moved. 
FIG. 3 is an example screen shot of TfrmCompCol's and a TfrmSMF of the data 
analysis system. Each component window in FIG. 3 (TfrmCompCol instance) is 
30 responsible for reporting the following events to the TfrmSMF instance, which 
will relay it to all the shown windows (including the request initiator): 

- selection of a specific node (shown in figure three - The black dots on 
the component maps) and decoding of the actual map location selected. 
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(Thls has to be done by a component map, as individual maps may not 
be of the same size. Size of the map that was cliciced on is used to 
detemnine the current position within the map. (Within code this is Icnown 
as drawing a piclcing circle); 

5 - displaying labels on the maps. The active TfrmCompCol instance is 
responsible for obtaining the label string as well as the font, be it the 
default or some other font indicated by the user. This Infomnation Is sent, 
along with the relative position on the map where the user right clicked to 
add the label. A relative position is passed to the TfmiSMF, as all 

10 displayed TfrmCompCol instances may again not be of the same 

dimension. This will ensure that labels appear on the proper position on 
the maps; 

- selection of individual nodes, if Selection Mode is active. Each individual 
node is sent to the TfmiSMF. It in turn instructs every shown component 

1 5 map to select the node selected by the calling component map; 

- updates of passed neurons if either PathModeJ or PathModeJi is 
active. Depending on the active mode, each component map will be 
responsible for drawing the necessary visual cues; and 

- updating the selected position Indicator on a component fomn's colour 
20 bar (This is only an Indicator that the position has to be updated - Each 

individual window has to calculate exactly where the indicator should be 
drawn). 

Component maps may show colours that are present on their respective 
25 colour bars. The cluster map may also show a grayed-out color bar. In FIG. 4, 
an example component map colour bar is shown. The colouring of the 
component maps process is as follows (This applies to all component colour 
maps, as well as the U-Matrix, Frequency and Quantization Error maps) : 



30 • After completion of the training process, a 'colour Index' associated with 
each map position Is calculated. This calculation is based on scaling the 
final component value In a specific position to an index within the 
available colour range. (The available colour range, such as is shown in 
figure 4. is a set of hard coded constants. During the development phase 
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of this part of the system, it was decided that these values were unlil<ely 
to change, and can therefore be hardcoded.) Scaling is done using the 
following process: 

■ Let w^y^be the component value of the i'Vn component of 

5 the neuron at position (x.y) in the final neuron map. 

• Let fffMu^and m^„^^bB the respective component value 

maximum and minimum values. (Depending on where the 
calculation is done, these values may be the actual 
maximum and minimum values extracted from the original 
10 training data, or be as simple as the values 0 and 1 , which 

would be the case if the scaled values used during training 
are considered.) 



yl®Ws that 



is an index values into the range of available colours. Cfact 
15 is the number of available colours. 

• The calculated colour Index is used as an index into the set of available 
colours, and the colour Indicated by the index is then used to draw the 
hexagon at the proper position. 



20 As stated, the above approach is also used for colours used in frequency 

and quantization error maps. As these maps usually have a much smaller value 
range than needs to be mapped to a visual colour range, differences in node 
values are much more accentuated. 

In one embodiment, the process of calculating the current position on the 

25 colour map may be used. The process of converting the mouse cursor location 
to coordinates values in temis of rows and columns on the map is needed to 
perfonn a large part of the functionality exhibited by the Tf nnCompCol instance. 
The colour map shown is a bitmap image that created once before any maps 
are shown. This map image is then Bitbltted onto the TfnnCompCol canvas, 

30 and redrawn as necessary. This was chosen as the implementation as it is 
faster to redraw portions of a precreated bitmap image, than recalculating and 
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drawing hexagons on an ad hoc basis, though either process, and other 
suitable methods, are possible. 

The process to translate a mouse cursor location into a (row.col) pair of 
5 values that can be used by the application is as follows: 

Assume that the following values are available, and are up to date: 

• »««ri tf^® number of rows In the map 

• is the number of columns in the map. 

1 0 • *he width of the image reporting mouse moves. ( Is therefore 

the maximum horizontal mouse position that can be reported.) 

• 7^,,^ is the height of the image reporting mouse moves. 

• Radius„,„^ is the horizontal radius. It signifies the radius of a 
hexagons, calculated using the cunrent width of the map. (It is calculated 

15 asRadius„^=-^^-^f3/2). 

• Radiusy^^ is the vertical radius, based on the current height of the 

1 *2 

Image. (Calculated as Radius „„,^ = """'I ) 



{x^,y^) represents the mouse location. 



20 Note that Radiusy^^ and Radius„^^ need only be recalculated when the 
map is resized. 

To calculate the actual row and column positions, (a^ and 
respectively) a reverse order of operations need to be executed, based on the 
formulas for Radiusy^ and Radius „,„^, while solving for the row and 
25 column. It is done as follows, but detennining possible drawing coordinates 
based on row and column calculations: 

1 3 
• forr€[l,n^] 
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o if r is even, calculate as 

= Radius + c^^ius,„^2 , else calculate It as 

1 3 

5 o If - y^l ^ J«arffi»«,rt»«Mi and - ^i»^v^ . *hen 

take the r and as the row and column values of the current 
mouse position and break the loop, else continue the loop 

In one embodiment, the process of predicting component values based 
10 on a trained SOM is summarised as follows: 

• For the vector to be predicted, calculate the best matching unit on the 
map. 

• Copy the missing values from the best matching unit* s vector the vector 
1 5 being predicted from. 

Other issues also come into play, as data needs to be scaled to within the same 
range as is used internally by the map, and then reseated to be within the 
ranges of the domain. 
20 Scaling values to the domain of the map is simple, as all map component 

values are within the range [0,1]. Assuming that a,™" and respectively 
represent the maximum and minimum values for component i\n the training 
domain, a component value can be scaled to the map domain using the 
formula: 

25 ^;=^?^. (22) 

a, -a, 

Applying this formula to each available component of an input vector, it can be 
scaled to the map domain. In one embodiment, the vector's best matching unit 
is found using equation (1). Found attribute values are then scaled back to the 
training domain. (Note that the training domain's maximum and minimum values 
30 are used, as they represent the learned subspace of the problem.) 'Scaling 
back' can be done using the formula: 
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(23) 



In one embodiment, a predict function is used. The predict function can 
b9 implemented in different ways, based on the number of neurons talcen into 
consideration when interpolating new values. The process above resembles an 

5 EasyPredict technique, where only the best matching unit of a data pattern is 
used to detenmlne possible new values. The predict function can however also 
be done by taking a Neighbourhood of values into consideration. This implies 
that either a set number of Neighbourhood nodes can be used for interpolation 
of all nodes within a certain distance of the best matching node (BMU). In one 

10 embodiment, the latter technique Is implemented in the data analysis system. 
Note that this calculation only considers nodes in the Neighbourhood of the 
BMU that lie within the same cluster as the best matching unit. 

Nodes may be allowed/disallowed by specifying which components of 
the cunrently selected node should be kept constant. The user may also specify 

1 5 that a variance on a component value be allowed. 

For the actual detemnination of which nodes may be allowed, each node 
in the map is compared to the selected node, and based on the differences 
between component values, a decision is made. The complete calculation for 
all nodes is described as follows, where c represents the weight vector of the 

20 selected node: 



• For each node in the map, n: 

o For each component value of the weight vector 3c„ of unit n, 

■ Determine the maximum and minimum component values. 



25 



a,"" and a,"^ respectively. 

Retrieve the standard deviation for the current component. 



30 



> C£dculate the allowed variance for this component aj , to be 

a, If the user chose to allow variance, or 0.0 otherwise. 
■ If element iis to be kept constant, and |3c„, -c,|<a,'then 



node n is to be allowed. 
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A log file may used to record the sequence of system events. When 
starting the application, the following sequence of events takes place: 



^ Application->lnltializeO Is called 
5 ^ FormCreate is called. 

=» Validity of license is checiced. 

Application->Run(), which start processing event triggers in the Winl\/lain 
processing loop. 

When the application is closed, the following sequence takes place: 
10 => Close request is processed 
=> FonmClose is called 

=^ Shutdown date is called, and written to the Iteensing file. 
Application is terminated. 

15 In one embodiment, regardless of what the data capturing structure of 

the logfile is, at the beginning of the file a reference is kept to the last time the 
application was run. This time can easily be synchronized on application shut 
down when the licensing file is updated with last usage Information. This 
enforces sequential dates when using the data analysis system. Maintaining the 

20 same level of date handling in the logfile allows the application to 
crossreference dates, to pick any possible tapering with the logfile. Possible, 
there may also be kept track (in the Registry/license file) of the file date of the 
logfile, to detect tampering/unauthorised changes. 
An example logfile structure is: 

25 

Log file signature(32-bits) 
Last open date (32-bits) 
{Entries} 



In accordance with one exemplary embodiment, each entry in the logfile has the 
format <Key, Value>. This allows the creation of an intelligent lexer and parser 
that can read a fairly random file format. This gives the application writer more 
30 freedom to add different kinds of data to the binary structure of the logfile. In 
one embodiment, the 'Last Open Date" position in the logfile should be written 
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at the same offset. This allows is to be updated and rewritten to 6\sk when It 
would not be necessary to change the log data, or append to It. Using VOL 
classes TflleStream and Treader/Twriter, the fmOpenReadWrlte file access 
mode, which is essentially a wrapper for the traditional append made when 
5 using the C style primitives defined in <stdio.h>, will allow the system to 
dynamically update only the a single position In the file. 

In one embodiment of the Invention, a log Is kept of all actions that may 
be of value to a user/company of the data analysis system. Relevant actions 
include, but are not limited to: 

0 

■ Predictions from a file 

■ Interactive predictions 

■ Printing of maps 



15 One process of maintaining a log file is to define a clear application 

program interface (API) that allows calls to write directly to the log file, and 
automatically update the log file structure to maintain its integrity as described 
above. This will allow single update calls, and different type of updates can be 
made using this interface. Maintenance of the log file is then supposed to be 

20 transparent to the developer. 

in one embodiment, the data analysis system uses a specified file 
format. In order to properly integrate template support into the data analysis 
system, the following specifications may be included: 



25 •A template selector. Lists all the templates found in the application 
directory. (All .rdt files, from which the template names are extracted. 
Possible, if a single template is specified, that template will automatically 
be opened.) Only templates contedning valid license Iceys (i.e. the same 
as that contained In pq.dil) may be selected and opened. 

30 • A template editor. (Copy of the one found in the template generator.) 

Allows editing existing data in the template, as well as adding, deleting 
etc. Rudimentary copying and pasting, with type aware (numeric, string, 
discrete values) updates are supported. Allows to save, and export data 
input tab delimited text format, from which can be trained. 
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• License key update monitor. When tliere is a valid cliange in a user's 
iicense Icey, (before it expires, using fnnUc etc.) this update must be 
reflected In each associated template containing the old license key. The 
easiest approach would be to assume that all templates are dumped In 

5 the application directory. Provision may also be made for the user to 

specify locations of templates, if it decides to keep them In any other 
location than the default. 

• Creation of a UnsupervisedData instance directly from a template. It's 
important to create it directly from the .rdt file, as this will avoid having 

10 unnecessary copies of the same data in memory. For large files, this 

becomes an issue. This technique will be useful for files involved In the 
following actions: 

o Creation of a new knowledge filter, 
o Time series modelling 
15 o Evaluation of external file statistics. 

o Prediction from a file (to get the data set bounds, size, etc.) 

• Creation of a TStringList that contains TstringList instances as individual 
member objects. As the final training file fomnat can be trivialiy induced 
from the template, this will NOT be a problem. It will be necessary in the 

20 following cases: 

o Labelling from a template file (.rdt) 

o Interactive labelling from a template file, (.rdt) 

In one embodiment, template usage will be 'always on', i.e., it will be enabled 
25 regardless of the type of iicense key used. This implies that support will have to 
be added to ensure that actions tiiat are typically associated witii the full version 
of The data analysis system will not be accessible while using the RapViewer 
version. At tiie moment, this implies ttiat, in the viewer: 

30 • IVIaps may be created from templates, but not from the 'typical' input files 
such as .txt and .csv. 

• All other actions that may typically be done using text files, may still be 
carried out. 
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Menu additions: 

• Rie menu option tliat wiii aiiow the opening of a tempiate for editing 
(Submenu on f iie, or a compietely new subgroup?) 

• A new tooibar tliat supports aii of this. 

5 

In accordance with one embodiment, the results from the training 
process are stored in a l<nowledge filter file format. The Icnowledge fiiter file 
format encodes the output from the self-organising map training process, as 
well as custom information used to speed up information and some of our 
10 custom functions. In one exemplary embodiment, the data analysis writes the 
file fomnat into a Microsoft shared storage object format The following is an 
exemplary file format with example comments: 

<String> Signature '^wmb1|wmb2" 
15 <int32> RowCount 
<int32> ColCount 
<lnt32> AttributeCount 
//Attribute names 

#AttributeCount x <Strlng> #AttributeName 

20 

// attribute values (som output) 
foreach r in [1 , #RowCount] 
foreach c in [1, #ColCount] 
foreach a in [1, #AttributeCount] 
25 <double> weight value 

// Statistics vectors 
// Minima 

#AttributeCount x <double> 

30 //Maxima 

#Attributecount x < double> 
// averages 

#AttributeCount x <doubie> 
// Standard deviations 
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// Incremental values 
#AttrlbuteCount x <double> 

5 

// Umatrlx information 
<clouble> Minimum value 
<double> Maximum value 

// umatrix entry for each node in the knowledge filter 
1 0 #RowCount X #colCount x <double> 

// Attribute drawing distances 

// I.e. colour Indexes used to do colour space 

//visualisation 

1 5 #RowCount X #ColCount x #AttributeCount x <lnt32> 

// Node frequency infonfnatlon 
<lnt32> Maximum frequency 
#RowCount X #ColCount x <int32> 

20 

// Quantisation Error Infomiation 
<double> Minimum value 
<double> Maximum value 
// values per node 
25 #RowCount x #ColCount x <double> 

// Save clustering information 
<int32> optimal number of clusters. 
<bool> Flag used during visualisation, do not leave out 
30 #RowCount X #ColCount x <lnt32> 

// clustering information for the 50 possible 

// cluster configurations 

#RowCount X #ColCount x 50 x <int32> 



wo 2005/006249 



-48- 



PCT/AU2003/000881 



//clustering Indicators Information. 
<lnt32> #ClusterlndlcatorCount 
#Clu8terlndicatorCount x <double> 

In one embodiment, an equal distance averaging (EDA) technique Is 
used to perfonm predictions from a knowledge filter In the data analysis system. 
Predictions are made from a i^nowledge filter by finding a single node in the 
{knowledge filter that most closely represents a provided data vector. The 
process can be summarised, for a knowledge filter F as: 

Rnd the node vector, n In the knowledge filter that most closely 
represents the input data vector, d , i.e., 

ii = argmink^^-d| I, Vre [l.Ji:,],Vce [l.J5:c]; (1) 
replace missing entries in dwith the con'esponding entries from n. 

EDA functions In a fashion similar to the general prediction technique 
described, but with an aKemative to the lookup procedure represented in 
equation (1). When finding the best matching node n, it may often be the case 
tiiat multiple nodes may offer an equal results when calculating the alternative 
Euclidean distance ||^ . In such a situation, all the equal nodes must be taken 

into consideration. The procedure above may then be replaced with the 
following approach: 

Rnd n, using equation (1). 

Build a list of knowledge filter nodes values, M, such that for each 
element mof M , ||m-d|^ =0. 

If Mis empty (i.e. it contains no elements), replace missing in d with 
corresponding entries inn. if Mis not empty, replace each missing entry in 
dwith the average value of the corresponding position of all the elements In M . 
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In another embodiment, a logical flow of interface and infonnation for 
analysis functions is used. In one embodiment, a composite attribute window is 
used to visualize multiple attributes present In a knowledge filter (and 
consequently In any dataset) on a single plane. The function is made available 
5 through a context menu on the attribute window. The context menu option is 
only available on attributes that have a default value range [0,1]. (While this 
limitation is not necessary, but may be used to force the user to adhere to 
confonnance in attribute value ranges). I 

When a number of attributes are visualized on the same attribute 

1 0 window, a new attribute image is created that displays the union of the selected 
set of attributes. It is done by constructing a new attribute matrix, and selecting 
the highest value for each attribute value from the chosen set of attributes. An 
example composite fiKer showing the concunrent visualization of multiple 
attributes Is shown in FIG. 5. A composite filter in a binary attribute window Is 

15 shown In FIG. 6. The process can be summarised as follows: 

For each attribute, a matrix Cx is defined, where x is defined as the Index 
of an attribute. A specific row and column value for an attribute is 
represented by Cx,[r,c]. In the composite attribute window, the graph 
20 drawing algorithm (discussed above) finds the highest (and consequently 

the required) by calculating: 

Cc[r,c] = min {Ca1[r,c] Can[r,c]}. 



The range {a1:an} defines the selected set of attributes, and [r,c] can be any 
25 row and column position in the knowledge filters valid range. 

In another embodiment, a range filter allows a user to select regions on a 

knowledge flKer by filtering out nodes based on explicitly defined value ranges. 

An example range filter interface screen shot is shown In FIG. 7. As can be 

SSet^iJ"?;! IfJi^v Jssi^J^^'eytr valye . «inge.: pan be. .defined, per attribute, . Yalu^ 
30 ranges caii alsdlbefspecified by visually selecting a required range, by clicking 

9n^^e|.5x>!9,i|r: l]tar^$}0;49.ie3Ch 6f4he-£^ii3ute values.. When a user selects the 

"die bi^on, thV selection ieictive on the knowledge Is updated to reflect all the 

nodes that adhere to the specified value range criteria. 

■ The procedure executed to perform this function, is as follows: 
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Construct two input vectors, d'^andd"", respectively containing tlie 
minimum and maximum vaiues specified in the visuai interface shown alx>ve; 
Clear any active selection on the Imowledge filter; 
5 Any node position <r,c> is added to the active selection, if, >/ieK, 

df^ ^ K^^^ ^ dr* . for all the non-empty entries in d"^ andd"" ; 

The new selection is drawn on ail open attribute windows. 



In another embodiment, a visual scaling function is used. Visually 

10 scaling an attribute window representation allows a user to interactively 
changes to colour scale used to represent an attribute in a Icnowiedge filter. It is 
done by changing the minimum and maximum values used to calculate the 
colour progression used to visualize an attribute. A changed minimum or 
maximum value results in a re-interpolation of the active colour ranges over the 

15 new valid 'range' of attribute values. In FIG. 8, the image shows visualization of 
an attribute that contains outlier data. A single red node shows where the outlier 
was placed during training, with the rest of the attribute appearing to be values 
all close to 0.0. However, when scaling Is applied, in FIG. 9. the more 
convoluted attribute visualization space is Identified. This function thus allows 

20 the user to easily determine what the true nature of an attribute is relative to the 
apparent visualized space. 

Normally, the colour used to visualize a particular node on an attribute 
window is found by using the minimum and ma}dmum values found during the 
training phase of the Icnowiedge filter. Thus tiie colour index is found by 

25 calculating: index_position = (JK^^^-i^)l{i^-i^). index _ position is thus 

assigned a value in the range used to retrieve an actueU colour value from a 
lookup table. 

When creating a scaled attribute image, the calculation Is adapted to 
reflect the new minimum and maximum values. 5^ and i^^are assigned the 
30 minimum and maximum values represented by the scaling triangles In the visual 
interface. We also again explicitly define and i^-^os the minimum and 
maximum vsdues found for attribute i during training, index _ position's 
calculation is then redefined by the following steps: 
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If (*:^^, > S^), theni^ = ; 

index _ position = (AT^^^ - ^buo) '(-^iiitt ^min) ^ 

5 

One exemplary labelling process in the data analysis system places 
labels on a iaiowledge filter by: 

linking attribute columns In ttie input file to attributes In the knowledge 
10 filter; 

selecting which attributes from the input file is to be used for labelling; 
determining with which row and column each row In the Input file is 
associated; and 

placing the actual labels on the knowledge filter. 

15 

In one embodiment, interactive labelling brings some more intelligence to 
the process. It allows the user to conditionally place labels on a knowledge filter, 
based on attribute value specifications entered interactively. The process is 
approximately the same as for nonmal labels (as described above) but follows 
20 the following logic: 

attributes columns from the input data are linked to specific attributes in a 
knowledge filter; 

a window is displayed that allows the user to select a search column, and 
25 a set of labelling attributes; 

the user can then enter a conditional statement that is used to filter 
through all the rows in the Input data source, and only those rows that adhere to 
the specified condition is extracted; and 

the user can choose whether the results are to be placed on the 
30 knowledge filter in the form of labels. 

In a Search on Attribute section the user may select the attribute that you 
want to test a condition upon. Next specify the "tests" for when to apply labels 
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by entering a value next to Search for and a Condition that the selected attribute 
must satisfy. The following table explain how the various condition tests 
operate: 

s Finds attributes that exactly match the value specified. 

> Rnds attributes that are greater than the specified value. 

>s Finds attributes that are greater than or equal to the specified value. 

< Rnds attributes that are less than, the specified value. 

<s Finds attributes that are less than or equal to the specified value. 

.... Searches the specific attribute value for a partial string match based on 
the entered value. In the above window, selecting this search condition, 
specifying 'Surname' as search attribute and entering 'K* in the Search for 
text box, will retum all surnames starting with a 'K. Entering more 
characters will refine this search until no more entries in the data file 
match the specified search criteria. If a numeric attribute is chosen as the 
search attribute, a textual representation of the actual value is used to 
match the record using this condition operator. 

5 

In a Label using Attributes section the user may select which attributes 
are applied as labels from the records from the input data file that passes the 
test you set up. More than one attribute may be specified. Labels will be 
constructed by using actual values from the input file, separated by commas 
1 0 when more than one attribute has been specified. 

In another embodiment, an advanced search function offers functionality 
similar to that described above for the interactive labelling function. Where 
interactive labelling is driven by a set of entered labels and a single criteria, 
advanced search is driven by a set of criteria specified by making a selection on 
15 a l<nowledge filter. The data analysis system then interactively updates the 
resulting set of rows read from a data source. 

The process can be summarised in the following steps: 



20 



a set of data records are read from a data source; 

attribute columns from the dataset is matched to attribute in the 

Icnowledge filter; and 
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a results window is displayed which lists all the records that are 
associated with nodes that are part of the active selection on a 
knowledge filter. 

5 As the selection on the knowledge filter changes, the set of displayed data 
records are updated to reflect the new selection. 

The complete process followed when using the advanced search 
function, can be summarised as follows: 

1 . The user specifies a selection on the knowledge filter. A selection here 
10 is a set of nodes, N , where each element in N is a set of row column positions. 

2. The user specifies an input file. 

3. RapAnalyst reads the file into memory, noting all the column names. 

4. RapAnalyst matches columns with names equal to those in the 
knowledge filter, to ttie corresponding attributes. 

15 5. The user is allowed to visually specify which column from the input file 

is linked that which attribute in the knowledge filter. 

6. For each record din the file read to memory In step 2, the following 
steps are performed: 

7. The list of nodes, M , returned by the EDA prediction function (see 
20 above) is calculated. 

8. FromM, the set of row-column position of each element in Mis 
extracted into the list M' . 

9. The overlap between NandM^ is calculated, i.e. M" » N nM* . 

10. If M" is non-empty, a text entry corresponding to dis Inserted into 
25 the advanced search window. 

A results window is displayed which lists all the records that are 
£»sociated with nodes that are part of the active selection on a knowledge filter. 

As the selection on the knowledge filter changes, the set of displayed 
data records are updated to reflect the new selection. This is done by repeating 
30 the process described above from step 6. 

In accordance with another embodiment, a zooming function may be 
used. Zooming into a knowledge filter allows a user to simulate drill-down data 
analysis with a knowledge filter as its basis. Zooming is best understood when 



wo 2005/006249 PCT/AU2003/000881 

-54- 

oonsidering the process associated with it. Zooming is performed in the 
following way: 

an initial loiowledge filter is constmcted; 
5 a user can then make a selection of nodes that will fonn a base reference 

of Interest; 

the user then defines a set of data records from an external data source; 
the set of records are then matched to the knowledge filter; and 
all records that are linked to the matched region Is flagged and written to 
10 a temporary file. 

The temporary file may then be used to construct a new knowledge filter. 

The zooming process thus effectively allows the user to focus on a 
specific region within an existing knowledge filter and perform training only on 

1 5 records that are associated with this region. 

In accordance with another embodiment, a Binarisation process may be 
used. Binarisation is a pre-processing function that automates converting any 
attribute (even non-numeric and non-scalar attributes) Into one or more toggled 
attribute values. Often, columns containing class Information are provided In a 

20 text format. Converting this information Into a usable fomnat requires manually 
replacing values with their corresponding numerical representation. For more 
than two classes and depending on the pre-processing needs, it may also be 
necessary to add new columns containing class information. Binarisation 
automates this process. 

25 In a hypothetical pre-processing task, an attribute column A may contain 

three classes, A1, A2, and A3. Binarisation will create three new columns In the 
input data called A1, A2 and A3. New attribute values are detemiined based on 
the value present In the original data, with the simple rule that if an entry for A is 
equal to A1 , a value of 'V will placed in the conresponding new column. For the 

30 remaining columns, zeroes are Inserted. FIG. 10 is an example illustration of 
the binarisation process, in accordance with an emt>odiment of the present 
Invention. Initial Table 1000 may be converted to Table 1010 using the 
binarisation process. Table 1010 is a numerical representation of class 
information, that can be used to train the knowledge filter creation algorithm. 
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FIG. 11 is a block diagram of an exemplary architecture for a general 
purpose computer suitable for operating the data analysis system. The 
illustrated general purpose computer may also be suitable for running 
applications. A microprocessor 1100, Including of a central processing unit 
5 (CPU) 1105, a memory cache 1110. and a bus interface 1115. is operatlveiy 
coupled via a system bus 1180 to a main memory 1120 and an Input/Output 
(I/O) control unit 1175. The I/O interface control unit 1175 is operatlveiy 
coupled via an I/O local bus 1170 to a disk storage controller 1145, video 
controller 1150, a keyboard controller 1155, a network controller 1160, and I/O 
10 expansion slots 1165. The disk storage controller 1145 Is operatlveiy coupled 
to the disk storage device 11 25. The video controller is operatlveiy coupled to 
the video monitor 1 130. The keyboard controller 1 155 is operatlveiy coupled to 
the keyboard 1135. The network controller 1160 is operatlveiy coupled to the 
communications device 1140. The communications device 1140 is adapted to 
1 5 allow the network inventory adapter operating on the general purpose computer 
to communicate with a communications network, such as the Internet, a Local 
Area Network (l-AN), a Wide Area Network (WAN), a virtual private network, or 
a middleware bus, or with other software objects over the communications 
network, if necessary. 
20 Computer program instructions for implementing the data analysis 

system may be stored on the disk storage device 1 125 until the processor 1 100 
retrieves the computer program instructions, either in full or In part, and stores 
them in the main memory 1120. The processor 1100 then executes the 
computer program instructions stored in the main memory 1120 to implement 
25 the features of network Inventory adapter. The program instructions may be 
executed with a multiprocessor computer having more than one processor. 

The general purpose computer illustrated in FIG. 11 is an example of a 
one device suitable for performing the various functions of the data analysis 
system. The data analysis system, and any other associated applications, 
30 components, and operations, may also run on a plurality of computer, a network 
server, or other suitable computers and devices. 

Other variations may be incorporated into the data analysis system. For 
example, in one embodiment, the data being analysed may come from more 
than one source and amalgamation may be used. In such a situation, unique 
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identifiers may be used as Icey attributes in each data source that linlcs the 
records together. 

In another embodiment, data summarisation may be used to reduce 
many-to-one reiationships into single records. For example, a single customer 
5 may have many transactions. Each transaction is a single record. However, the 
analysis, in one embodiment. Is performed is focusing on customers (or certain 
fixed actors in the data set), not transactions. In such a situation, the many 
transaction records may be summarised into one customer record by 
calculating, for each customer, certain record attributes such as, for example, 
10 the number of transactions, the total value of all transactions, the time since first 
transaction, the time since last transaction, the average value of ail 
transactions, the average time between transactions, the number of 
transactions made during business hours, and any other suitable entries. 

In another embodiment, data manipulation may be used to reduce 
15 temporal sequences into single records. When subsequent records of data 
represent readings of values over time and the nature of the progression 
through time is an important aspect of the investigation, data manipulation Is 
used. The data analysis system effectively considers each data record 
independently, I.e. the order of tlie records is not considered. A typical 
20 translation performed to combine many rows of data into one row of data 
containing many different timescales may be as follows: 

Original data 





Time (t) 


A 


B 
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The first row captures readings of A and B at time 2, 1 and 0 and the 
second row encapsulates time 3, 2 and 1. An aitemative approach may 
calculate new attributes as average percentage change from recent previous 
records in order to capture the nature of the progression over time In a single 
5 record. 

In yet another embodiment of the Invention, attributes used to train the 
data analysis system are considered to be scalar variables. Scalar variables 
are those where the value is measured according to some scale. Temperature, 
speed and percentage are all examples of scalar variables. Similarly, a survey 

10 with 3 possible responses where 1 is disagree, 2 is indifferent and 3 is agree 
would be considered a scalar variable — because conceptually 2 belongs 
between 1 and 3, I.e. in this case 2 is better than 1 but not as good as 3. 
Accordingly, In this embodiment, data may be converted from binary and non- 
scalar data to scalar data for analysis. 

15 The previous description of the exemplary embodiments is provided to 

enable any person skilled In the art to make or use the present Invention. While 
the Invention has been described with respect to particular illustrated 
embodiments, various modifications to these embodiments will readily be 
apparent to those skilled in the art, and the generic principles defined herein 

20 may be applied to other embodiments without departing from the spirit or scope 
of the invention. It is therefore desired that the present embodiments be 
considered in all respects as illustrative and not restrictive. Accordingly, the 
present invention is not intended to be limited to the embodiments described 
above but is to be accorded the widest scope consistent with the principles and 

25 novel features disclosed herein. 



